NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

TensorRight: Automated Verification of Tensor Graph Rewrites

https://doi.org/10.1145/3704865

Arora, Jai; Lu, Sirui; Jain, Devansh; Xu, Tianfan; Houshmand, Farzin; Phothilimthana, Phitchaya Mangpo; Lesani, Mohsen; Narayanan, Praveen; Murthy, Karthik Srinivasa; Bodik, Rastislav; et al (January 2025, Proceedings of the ACM on Programming Languages)

Tensor compilers, essential for generating efficient code for deep learning models across various applications, employ tensor graph rewrites as one of the key optimizations. These rewrites optimize tensor computational graphs with the expectation of preserving semantics for tensors of arbitrary rank and size. Despite this expectation, to the best of our knowledge, there does not exist a fully automated verification system to prove the soundness of these rewrites for tensors of arbitrary rank and size. Previous works, while successful in verifying rewrites with tensors of concrete rank, do not provide guarantees in the unbounded setting. To fill this gap, we introduce TensorRight, the first automatic verification system that can verify tensor graph rewrites for input tensors of arbitrary rank and size. We introduce a core language, TensorRight DSL, to represent rewrite rules using a novel axis definition, calledaggregated-axis, which allows us to reason about an unbounded number of axes. We achieve unbounded verification by proving that there exists a bound on tensor ranks, under which bounded verification of all instances implies the correctness of the rewrite rule in the unbounded setting. We derive an algorithm to compute this rank using the denotational semantics of TensorRight DSL. TensorRight employs this algorithm to generate a finite number of bounded-verification proof obligations, which are then dispatched to an SMT solver using symbolic execution to automatically verify the correctness of the rewrite rules. We evaluate TensorRight’s verification capabilities by implementing rewrite rules present in XLA’s algebraic simplifier. The results demonstrate that TensorRight can prove the correctness of 115 out of 175 rules in their full generality, while the closest automatic,bounded-verification system can express only 18 of these rules.
more » « less
Full Text Available
TpuGraphs: A Performance Prediction Dataset on Large Tensor Computational Graphs

Phothilimthana, Phitchaya Mangpo; Abu-El-Haija, Sami; Cao, Kaidi; Fatemi, Bahare; Burrows, Michael; Mendis, Charith; Perozzi, Bryan (May 2024, Proceedings of the 37th International Conference on Neural Information Processing Systems)

Full Text Available
Large graph property prediction via graph segment training

Cao, Kaidi; Phothilimthana, Phitchaya Mangpo; Abu-El-Haija, Sami; Zelle, Dustin; Zhou, Yanqi; Mendis, Charith; Leskovec, Jure; Perozzi, Bryan (May 2024, International Conference on Neural Information Processing Systems)

Full Text Available
Large Graph Property Prediction via Graph Segment Training

Cao, Kaidi; Phothilimthana, Phitchaya Mangpo; Abu-El-Haija, Sami; Zelle, Dustin; Zhou, Yanqi; Mendis, Charith; Leskovec, Jure; Perozzi, Bryan (December 2023, Advances in neural information processing systems)

Learning to predict properties of a large graph is challenging because each prediction requires the knowledge of an entire graph, while the amount of memory available during training is bounded. Here we propose Graph Segment Training (GST), a general framework that utilizes a divide-and-conquer approach to allow learning large graph property prediction with a constant memory footprint. GST first divides a large graph into segments and then backpropagates through only a few segments sampled per training iteration. We refine the GST paradigm by introducing a historical embedding table to efficiently obtain embeddings for segments not sampled for backpropagation. To mitigate the staleness of historical embeddings, we design two novel techniques. First, we finetune the prediction head to fix the input distribution shift. Second, we introduce Stale Embedding Dropout to drop some stale embeddings during training to reduce bias. We evaluate our complete method GST+EFD (with all the techniques together) on two large graph property prediction benchmarks: MalNet and TpuGraphs. Our experiments show that GST+EFD is both memory-efficient and fast, while offering a slight boost on test accuracy over a typical full graph training regime.
more » « less
Full Text Available
Swizzle Inventor: Data Movement Synthesis for GPU Kernels

https://doi.org/10.1145/3297858.3304059

Phothilimthana, Phitchaya Mangpo; Bodik, Rastislav; Elliott, Archibald Samuel; Wang, An; Jangda, Abhinav; Hagedorn, Bastian; Barthels, Henrik; Kaufman, Samuel J.; Grover, Vinod; Torlak, Emina (April 2019, ASPLOS)

Utilizing memory and register bandwidth in modern architectures may require swizzles — non-trivial mappings of data and computations onto hardware resources — such as shuffles. We develop Swizzle Inventor to help programmers implement swizzle programs, by writing program sketches that omit swizzles and delegating their creation to an automatic synthesizer. Our synthesis algorithm scales to real-world programs, allowing us to invent new GPU kernels for stencil computations, matrix transposition, and a finite field multiplication algorithm (used in cryptographic applications). The synthesized 2D convolution and finite field multiplication kernels are on average 1.5–3.2x and 1.1–1.7x faster, respectively, than expert-optimized CUDA kernels.
more » « less
Full Text Available
Floem: A Programming System for NIC-Accelerated Network Applications

Phothilimthana, Phitchaya Mangpo; Liu, Ming; Kaufmann, Antoine; Peter, Simon; Bodik, Rastislav; Anderson, Thomas (January 2018, 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18))

Developing server applications that offload computation and data to a NIC accelerator is laborious because one has to explore the design space of decisions about data placement and caching; partitioning of code and its parallelism; and communication strategies between program components across devices. We propose programming abstractions for NIC-accelerated applications, balancing the ease of developing a correct application and the ability to refactor it to explore different design choices. The design space includes semantic changes as well as variations on parallelization and program-to-resource mapping. Our abstractions include logical and physical queues and a construct for mapping the former onto the latter; global per-packet state; a remote caching construct; and an interface to external application code. We develop Floem, a programming system that provides these abstractions, and show that the system helps explore a space of NIC-offloading designs for real-world applications, including a key-value store and a distributed real-time data analytics system, improving throughput by 1.3--3.6x.
more » « less
Full Text Available

Search for: All records